2023-10-18
Broadly, it is the process of automatically pulling data from websites by reading their underlying code. Doing this gets complicated fast:
It is difficult to predict how much time a web scraping task will take.
Sites might change, introducing need to update.
Site maintainers may not be okay with data being scraped. Quick plug for the TECH team’s Automated Data Guidelines.
Since Spring 2023, states have been dis-enrolling Medicaid beneficiaries who no longer qualify since the Public Health Emergency was ended.
In anticipation of “the great unwinding,” many states implemented policy changes to smooth the transition.
To understand the success of these policies, we wanted time-series enrollment data for all 50 states… from a Medicaid data system that is largely decentralized.
Why page through PDFs when another organization’s RAs can do it for you?
One URL with data you can only get by clicking each option!
Whenever new data were released in the following 2 months, I re-ran the code and got a well-formatted excel file as output.
2 months later, KFF stopped updating the dashboard and changed how existing data was reported on graphs.
Work-based learning can include internships, clinicals, co-ops and other opportunities to gain experience in a work setting.
They are especially helpful to community college students
We sought to create a national dataset of WBL prevalance by scraping course descriptions
Starting from a homepage, we used Scrapy to follow links containing keywords like “catalog” or “course descriptions”
For each link, we scraped basic metadata and all the text present
After a lot of hard work refining our approach, our data was a hot mess: